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Abstract 

Many machine learning tasks can be formulated 
in terms of predicting structured outputs. In 
frameworks such as the structured support vec¬ 
tor machine (S VM-Struct) and the structured per- 
ceptron, discriminative functions are learned by 
iteratively applying efficient maximum a posteri¬ 
ori (MAP) decoding. However, maximum like¬ 
lihood estimation (MLE) of probabilistic models 
over these same structured spaces requires com¬ 
puting partition functions, which is generally in¬ 
tractable. This paper presents a method for learn¬ 
ing discrete exponential family models using the 
Bethe approximation to the MLE. Remarkably, 
this problem also reduces to iterative (MAP) 
decoding. This connection emerges by com¬ 
bining the Bethe approximation with a Frank- 
Wolfe (FW) algorithm on a convex dual objective 
which circumvents the intractable partition func¬ 
tion. The result is a new single loop algorithm 
MLE-Struct, which is substantially more effi¬ 
cient than previous double-loop methods for ap¬ 
proximate maximum likelihood estimation. Our 
algorithm outperforms existing methods in ex¬ 
periments involving image segmentation, match¬ 
ing problems from vision, and a new dataset of 
university roommate assignments. 


1 INTRODUCTION 

Learning the parameters of a Markov random field (MRF) 
or a conditional random held (CRF) is a ubiquitous prob¬ 
lem in machine learning and related fields. Often, the 
parameters are learned via regularized maximum likeli¬ 
hood estimation (MLE) and then prediction is performed 
via maximum a-posteriori (MAP) or marginal inferenc^] 
As the log-likelihood is concave, it can in principle be 


maximized by gradient ascent. However, this requires re¬ 
peatedly computing gradients of the log-partition function, 
which in general is intractable. One can circumvent this 
difficulty by using surrogates for the log-partition func¬ 
tion ( | Sutton & McCallum| |2005| |Ganapathi et al.[ |2008| 
Domke 2013) or by approximating the partition func¬ 
tion using sampling (|Petterson et al.| |2009| |Papandreou & 
|Yuille||20TT) . 

Alternatively, one can avoid likelihoods entirely, and use 
methods such as the structured perceptron or structured 
support vector machines (SVM-Struct) that rely only on a 
MAP solver ( |Collins[ |2002[ |Taskar et al.[ |2004[ |Tsochan- 
taridis et al. 2004 Finley & Joachims, 2008jl. Such meth¬ 


ods can often be quite accurate and are typically faster 
than approximate MLE, since MAP, or relaxations thereof, 
can be performed quickly using sophisticated combinato¬ 
rial solvers. By using such solvers as black boxes, MAP- 
based training methods also offer users an attractive ab¬ 
straction between the learning problem and the optimiza¬ 
tion algorithm. On the other hand, MLE remains a primary 
goal for many practitioners, since it may yield superior pre¬ 
dictive accuracy, offers parameter values with increased in- 
terpretability and statistical properties, and supports test¬ 
time marginal inference. 

In this work, we introduce MLE-Struct, a novel approx¬ 
imate MLE algorithm that also only requires access to a 
MAP solver. We combine Bethe-style convex free energies 
with the Frank-Wolfe (FW) method ( |Frank & Wolfe|[T956[ 
Jaggi 2013|>. A naive application of FW for approximate 


MLE would perform approximate marginal inference using 
repeated calls to MAP, as in the experiments of |Sontag &| 
Jaakkola ( j2007j l, and then use these marginals to perform; 


single gradient step on the parameters. This double-loop al¬ 
gorithm requires a significant number of MAP solver calls, 
especially if very accurate answers are required. Our ap¬ 
proach achieves fast learning by avoiding this costly dou¬ 
ble loop structure. First, we employ a generic reweighted 
entropy approximation technique that yields convex Bethe- 


'in this paper, MAP inference refers to predicting an output 
Y given parameters 0, while MLE learning refers to estimating 6 


given observations (Y 1 , A' 1 ),..., ( Y m , X m ), optionally includ¬ 
ing quadratic regularization. 

























































style surrogate likelihoods for any underlying undirected 
graphical model. Then, we construct a constrained, convex 
dual problem for this approximate maximum likelihood ob¬ 
jective. We demonstrate that the approximate dual problem 
can be minimized efficiently using FW: each of the linear 
subproblems that are solved as part of the algorithm can be 
formulated as separate approximate or exact MAP infer¬ 
ence tasks on each training example. Finally, we introduce 
a technique to accelerate the line search subroutine of FW 
by precomputing certain data-dependent terms. 


We can also use FW to perform test-time marginal infer¬ 
ence using the procedure of Sontag & Jaakkola (2007). 
Therefore, at both train and test time we can interact with 
our underlying problem structure using only a MAP rou¬ 
tine. This allows us to design fast approximate learning 
and prediction algorithms for a wide variety of settings in 
which efficient approximate/exact MAP solvers exist: bi¬ 
partite/general matching and b-matching problems (via the 
max-flow and blossom algorithms), pairwise binary graph¬ 
ical models (via QPBO), planar Ising models with no ex¬ 
ternal field (via a reduction to matching) ([Schraudolph & 
Kamenetsky 2008), among others. 


That is, for Y £ y and A' £ X, 
p{Y\X-9) = 

ex P ('Eiev e iM x , y i) + YlaeA d <y-<t> a {X, F«)) 
Z(X-d) 

These models include both Potts and Ising models, as well 
as log-linear distributions over matchings ©• 

Given M observations (F^,..., Y^ M ^} with cor¬ 
responding feature vectors {XW,..., X^ M ^} where 
p(Y ; 9) is of the form Q, we would like to learn 
9 by maximizing the log-likelihood of the observations plus 
a quadratic regularizer. 

1(9-, y( 1:Af )) = (.{9\X^ m \Y^) - ^||0|| 2 

771= 1 

Here, 

£(6»;X (m) ,F (m) ) = -logF(X (m) ). 

2.1 CONVEX FREE ENERGY APPROXIMATIONS 


We apply our method to learn pairwise binary CRFs and 
distributions over matchings on both bipartite and general 
graphs. Our method provides good predictive performance 
while often solving the approximate MLE problem signifi¬ 
cantly faster with fewer numerical instabilities than other 
approximate MLE methods. We also apply our method 
to a new dataset of housing preferences and roommate as¬ 
signments of university students to predict good freshmen 
roommate assignments. 


2 BACKGROUND AND RELATED 
WORK 


We consider conditional random fields where, in addition 
to samples F^,..., F^ M ) from some discrete space y , 
we also observe feature vectors X^\..., X^ M > £ X (Laf- 
ferty et ak||2001|l. In this case, the conditional probability 


of the m™ sample has the form 


p{yi m )\X^-9) = 


exp((^(A( m ),F( m )),6»)) 

Z(™){X-,9) 


(1) 


where <f> is a vector of sufficient statistics, 9 is a vector of 
parameters, and the partition function is given by 


Z(X^-9) = ^ exp((</>(X( m \F),(?)). 
Yey 


In typical applications, the joint probability distribution 
factors over a hypergraph G = ( V. A) where A is a collec¬ 
tion of subsets of V. For ease of presentation, we assume 
that 


<KX,Y) = {</>i(X,Y i )\ ieV ,(l> a (X,Y a )\ aeA }. 


The central challenge in maximum likelihood estimation is 
computing the partition function Z(X^ m \ 9) for each sam¬ 
ple m at each iteration. In this work, we approximate the 
partition function in order to make the learning problem 
tractable. We begin with the Bethe free energy, a standard 
approximation to the so-called Gibbs free energy that is 
motivated by ideas from statistical physics. The approxi¬ 
mation has been generalized to include different counting 
numbers that result in alternative entropy approximations 
(Weiss et al.[ 12007) . We focus on a restricted set of count¬ 
ing numbers that result in a family of convex reweighted 
free energies. 

The reweighted free energy at temperature T = 1 is 
specified by a polytope approximation T, the hypergraph 
G = (V, A ), an entropy approximation H p , and a vector of 
counting numbers p (henceforth referred to as reweighting 
parameters). 

log F p (r, X- 9) 4 E(t, X ; 9) - H p (t), (2) 


where the energy is given by 

E{t,X\9) = — EE n(Y i )9 i (l )i (X,Y i ) 

i£V Yi 


aGA Y„ 

the entropy approximation is given by 


H p (r) A 


-EE TiiVi) log Ti(yi) 

iev yi 


-EE p a T a (y a ) log 

aeA y a 


T a (y a ) 

I! iea T i(ViY 





















and t is restricted to lie in an outer bound of the marginal 
polytope known as the local (marginal) polytope, 

T = {r > 0 : for all i £ V, ^ t* (Fj) = 1 

Yi 

for all a e A, i e a, Y u ^ t q (F q ) = ^(Fj)}. 

The reweighted partition function is then computed by min¬ 
imizing 0 over T 

Z p (X;9) = exp(— min F p (t,X;0)). 
reT 


Setting p a = 1 for each a £ A recovers the typical 
Bethe free energy approximation. The reweighting param¬ 
eters can always be chosen so that the approximate free en- 


ergy is convex (Wainwright et al. 

2003; )Hazan & Shashuaf 

2012[ Ruozzi & Tatikonda 

2013 

). For example, the tree- 


reweighted belief propagation algorithm (TRW) chooses 
the reweighting parameters so that they correspond to (hy- 
per)edge appearance probabilities of a collection of span¬ 
ning (hyper)trees. 


the graph is divided into smaller subgraphs over which the 
partition function can be efficiently computed exactly or 
approximately. These results are then combined to approx¬ 
imate the true partition function. Because the subproblems 
are typically much smaller, the procedure is quite fast but 


can be inaccurate if the pieces are too small (Ganapathi 


et al. 2008|l. For bipartite matchings, one can obtain an un¬ 


biased but noisy gradient of the log-likelihood by utilizing 
an 0(|F| 4 logF) perfect sampler algorithm due to 


Huber 


& Law (2008). [Petterson et al. (|2009[) use this approach for 


ranking and graph matching problems, but limited them¬ 
selves to 20 vertices, and each observation required its own 
set of samples. |Domke| ( |2013| ) proposed performing MLE 
using a small, fixed number of TRW iterations as part of 
a procedure to estimate the gradient. However, if TRW is 
not converging quickly (i.e., a reasonable solution is not ob¬ 
tained after running for a fixed number of iterations), the re¬ 


sulting procedure can fail to converge. Vishwanathan et al. 


( j2006| proposed improving the convergence in the outer 
loop using accelerated gradient methods. All of the above 
methods rely on a double loop. 


2.2 SADDLE POINT FORMULATION 


3 APPROXIMATE MLE 


We approximate the exact partition function in the MLE 
objective with a reweighted free energy approximation of 
the form This results in the saddle-point problem 


max mm 

6 T (l:M)g7- 


M 


E [{cj>{X^\Y^le) 


-\ogF p {r^\x^-e)\ - -\\e \\ 2 


■ (3) 


Heinemann & Globerson ( 201 1| > investigated unregular¬ 
ized likelihoods of this form for MRFs, and demonstrated 
that convexity of the Bethe free energy guarantees that the 
empirical marginals satisfy a moment matching condition: 
the empirical marginals minimize the Bethe free energy 
at the 6 that maximizes the approximate log-likelihood. 
Moment matching is not necessarily achieved for general 
MRFs when the reweighted approximation is not convex 
(Heinemann & Globerson 201 1) . Wainwright et al. (20031 
investigated the use of TRW for learning in pairwise bi¬ 
nary graphical models. They observed that the parameters 
learned via TRW were more robust to the addition of new 
data than those learned by BR This robustness of convex 
free energy approximations for learning can be made theo¬ 
retically precise (Wainwright] [2006} . 

If we compute the partition function via an iterative proce¬ 
dure, then solving ([3]) necessarily requires a double-loop al¬ 
gorithm, which can be expensive for large datasets. The ex¬ 
isting work on Bethe learning has sought to design more ef¬ 
ficient approximate learning algorithms. |Sutton & McCal-| 
lum (2005 1> proposed a piecewise training scheme whereby 


We now consider a convex dual reformulation of ([3) that 
applies to convex free energies and yields a new, fast learn¬ 
ing algorithm. We first note that ([3]i is concave in the vari¬ 
ables being maximized and convex in the variables being 
minimized, and one set of variables (the r) are constrained 
to a compact domain. We can thus invoke Sion’s minimax 
theorem (Sion, 1958) > to reverse the max and min opera¬ 
tors. Next, we can analytically solve for the optimal 9 in 
terms of fixed t^,... £ T. Setting the gradient 

with respect to 9 equal to zero in ([3]i yields 









(4) 
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ri m \Ya)MX {m \Ya) 


(5) 


Finally, substituting these back into ([3]) yields the following 
optimization problem over the local marginal polytope. 

min T||r( r (1:M) )|| 2 _V fT p (r (m) ) 

T (i:M )er 2A" v p v > 

m 

= min L( r ( 1:m )) ( 6) 

T (l:M) g7 - 

The linearly-constrained convex objective ([6) can be mini¬ 
mized via general convex optimization techniques, such as 































































the ellipsoid method—though this can be slow in practice. 
In the sequel, we minimize this objective with the Frank- 
Wolfe algorithm (FW). 

Convex free energies can be obtained using the reweight¬ 
ing techniques above, and whenever the graph is a tree the 
standard Bethe free energy is both convex and exact. In 
principle, a similar argument can be made for any convex 
approximation of the partition function, though we only fo¬ 
cus on convex Bethe-style approximations in this work. 


The MLE objective is dual to the maximum entropy prob¬ 
lem, and recent work on approximate MLE has focused on 
different families of entropy approximations ([Wainwright 
|& Jordan] |2QQ8[ >. |Ganapathi et al,| ( |2008| ) also followed 
a maximum entropy approach to the approximate MLE 
problem. They proposed approximating the entropy objec¬ 
tive using the Bethe entropy approximation (i.e., Hi), but 
specifically avoided convex entropy approximations. Un¬ 
fortunately, this results in a non-convex optimization prob¬ 
lem in general, for which the authors use the concave- 
convex procedure. Other recent work approximated the 
MLE problem using convex free energies, but did not con¬ 
sider the maximum entropy approach (Domke 2013}. 


3.1 FRANK-WOLFE ALGORITHM FOR 
MAXIMUM LIKELIHOOD LEARNING 


Following Jaggi ( j2013| , FW minimizes a general convex 
function /( x) over a convex set X via a sequence of iterates 
defined by 


St = arg min (a;, X7 f(x t -i)} 

(7) 

x t = (l- 7t)x t -i + 7 tSt, 

( 8 ) 


where the step-size, 7 t , is either selected using line search 
or is fixed at 277 . 

For the objective function in ([ 6 ]), each step requires mini¬ 
mizing a linear objective over a linear set of constraints. 

s t = argmin r (i) > )r(M ) Gr <T (1:m) , VL(r^ : 1 m) )) (9) 


Since the constraints are separable across training ex¬ 
amples, decouples into M independent linear pro¬ 
grams (LPs) that can be solved in parallel. Depending on 
the specific application, purely combinatorial methods or 
reweighted message-passing algorithms may provide faster 
and more space efficient alternatives to generic LP solvers. 
Note that ([ 6 ) could not be solved with projected gradient 
algorithms efficiently, since projection onto the local or 
marginal poly topes is not tractable. 


Despite the ability to perform |9]) in parallel, it can still be 
prohibitive for large sample sizes. For convex optimiza¬ 


tion problems over separable constraint spaces, Lacoste- 


|Julien et al.| ( |2013} propose a block-coordinate FW algo¬ 
rithm (BCFW). The BCFW procedure performs the FW 


Algorithm 1 MLE-Struct: Frank-Wolfe Approximate 
Maximum-Likelihood Learning 

Input: training examples Y^}, reweighting 

parameters p £ [0, l] n , regularizer A 
Output: Approximate maximum likelihood 9. 
Initialization: Set each uniformly. 

repeat 

for Vto in parallel (batch) or to chosen uniformly at 
random (block) do 

4 m) = arg min T ( M) £ 7 -(rW, V (m) L(rf : 1 m) )) 

Set 7 = 5 Tt f° r batch and 7 = 2 M+ t f° r block or 
use line search. 

end for 

until converged 
Set 6 using Q and ([5J. 


iteration over a randomly selected to £ {1 ,,M} and 
leaves the remaining coordinates untouched. BCFW re¬ 
quires less work at each iteration, but the asymptotic rate of 
convergence remains the same as that for the standard FW 
algorithm (Lacoste-Julien et al. |2013} . This block coordi¬ 
nate approach is known to outperform FW for the SVM- 
Struct problem. Technical details concerning the conver¬ 
gence of FW for this problem, including methods to bound 
the convergence rate, can be found in Appendix|B] Both the 
FW and BCFW versions of our algorithm, MLE-Struct, are 
described in Algorithm[l]Line search can be accelerated by 


precomputing quadratic terms as discussed in D.2 


3.2 FRANK-WOLFE FOR MARGINAL 
INFERENCE 


In many applications, computing marginals is useful at test 
time, as well as during learning. Fortunately, we can use 
FW to perform marginal inference, thus maintaining our 
ability to interact with the underlying model only through 
a MAP solver. Specifically, approximate reweighted Bethe 
marginals can be obtained by minimizing | 2 ) with respect 
to r, which is a convex problem suitable for FW. The tech¬ 
nique was first used in Sontag & Jaakkola ( 2007} , using a 
generic LP solver for MAP. 

In Appendix [F| we provide experiments on CRFs defined 
over bipartite matchings (see Section |4. 1 } , demonstrating 
the favorable accuracy and speed of FW-based inference 
versus the BP algorithm of |Huang & Jebara ( j2009j ) and 
an instance of the Perturb-and-MAP framework designed 


specifically for matchings (Li et al. 2013). We find that FW 


outperforms Perturb-and-MAP in terms of both accuracy 
and convergence speed. FW and BP minimize the same 
objective, since the Bethe entropy is convex for matchings, 
so we compare them purely in terms of speed. We find 
that FW is preferable to BP in most regimes, except when 
















































extremely precise optimization is required. 

4 APPLICATIONS AND EXPERIMENTS 

We apply the MLE-Struct framework to a variety of expo¬ 
nential family models defined over different combinatorial 
structures, including grid CRFs for image segmentation, bi¬ 
partite matchings in vision applications, and general per¬ 
fect matchings for a university roommate assignment prob¬ 
lem. For CRFs, MAP inference is intractable, but we can 
efficiently solve the LP relaxation, which is equivalent to 


MAP inference over the local polytope with QPBO (Rother 


et al., 2007j l. This means our estimated pseudomarginals 
will not be globally consistent, but the procedure can still 
yield accurate predictions (Wainwrigh t et al.| 2003| l. For 
the matching problem, we use efficient max-flow solvers to 
obtain exact MAP solutions (i.e., over the marginal poly¬ 


tope) (Goldberg & Kennedy 1995| Kolmogorov| 2009) . In 
this case, our estimated pseudomarginals will be globally 
consistent. Appendix [A] details the data sources, feature 
extraction, and machine setup. 


the conditional likelihood is 

P( Y ;F'«,e>= ^ z{Fl:K e) -' (ID 

Theorem 4.1. For any p £ [0,1]I' I, any graph (bipartite 
or general), and any matching (perfect or imperfect), the 
reweighted free energy is convex over the local poly¬ 
tope. 

Theorem |4.1| is proven in Appendix [C] By inclusion, it im¬ 
plies |2]) is also convex over the marginal polytope. This 
generalizes earlier known results of convexity for bipartite 


perfect matchings (Vontobel 2013 Chertkov & Yedidia 


2013). Due to the convexity of the Bethe entropy and 


the availability of high-quality maximum-weight matching 
solvers, Algorithm[T]is well-suited to the approximate MFE 
task. A derivation of the specific form of <01 for matchings 
and a technique for making the associated line search par¬ 
ticularly efficient by precomputing certain data-dependent 
terms can be found in Appendix |D| 

4.1.1 SYNTHETIC BIPARTITE MATCHINGS 


4.1 PERMANENTS AND PERFECT MATCHINGS 


We first consider the problem of learning distributions over 
perfect matchings of a given graph. For a graph G = 
(V, E ) and edge weights Wij £ R, the probability of ob¬ 
serving a particular matching is 

f(Y',w) = exp tr(TVF)) (10) 


where Y is the adjacency matrix of a perfect matching in G, 
W is a weighted adjacency matrix of G, and Z(W) is the 
partition function. Each entry of W is a function of edge¬ 
wise features. Our formulation can be relaxed to distribu¬ 
tions over all matchings by allowing Y to correspond to the 
adjacency matrix of any (not necessarily perfect) matching. 


When G is bipartite, the partition function is the perma¬ 
nent of the matrix of edge weights and is thus #P-hard to 
compute (Valiant 1979| . Although the partition function 
can be computed to any given accuracy using a fully poly¬ 


nomial randomized approximation scheme (Jerrum et al. 


2004), such algorithms are impractical for graphs of any 


significant size. 


In practice, W is unknown and must be learned from data. 
We can learn a generative model by estimating W directly, 
or a conditional model by first assuming that W is the lin¬ 
ear combination of some feature maps and then learning 
the weights. For concreteness, suppose we have K fea¬ 
tures, and for the feature we have a |Vj x |V| matrix 
Fk. Let 0 £ R K be our model parameters, so that the 
weight on edge (i,j) £ E is W tJ = J2k=i ®kF fj. Then 


We begin with a synthetic experiment using the flexibil¬ 
ity of MLE-Struct to analyze the accuracy of various en¬ 
tropy approximations for matchings. We sample 10 x 10 
bipartite matchings from the distribution ©• We explore 
two choices of the weight matrix W: one in the high SNR 
regime with —2 on the off-diagonals and 0 on the diago¬ 
nals, and one in the low SNR regime —0.5 off-diagonal and 
0 on-diagonal. 


Our problems are small enough that we can compute exact 
partition functions and their gradients with Ryser’s algo¬ 
rithm (Ryser |1963} >. Hence, we can perform exact MLE 
with gradient descent. We can also evaluate the true (reg¬ 
ularized) likelihood of our estimates. We ran Algorithm |T| 
with p = 1 and called this result the Bethe estimator. In 
addition, setting p = 0.5 for this problem guarantees a 
concave entropy approximation and an upper bound on the 
partition function (Meltzer et al., 2009). We also ran Al¬ 
gorithm [I] at this setting and denote the result as the RW 
estimator. 


Figure |T] a) displays the average regularized log-likelihood 
of each estimator, higher being better and the Exact curve 
being an upper bound. In both low and high SNR 
regimes, the Bethe estimator is superior to the RW estima¬ 
tor. Reweighted entropies such as the one chosen here are 
known to perform poorly as estimators of the true partition 
function as compared to belief propagation. Interestingly, 
although the objective values of the estimates are different 
in each case, in the low SNR regime, all estimation meth¬ 
ods produce about the same likelihood. 

Our framework can also be used to bound the value of 
the true likelihood. First, since Z K vv provides an upper 
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Samples 

(a) High SNR problem : Average (divided by sample size) true 
regularized log-likelihood evaluated at the exact MLE as well as 
at the parameters that maximize the RW and Bethe approximate 
log-likelihoods. Higher is better, with the Exact curve being the 
upper bound. The data are nearly as probable under the Bethe 
estimator as they are under the exact MLE. 



(b) Low SNR problem: Same plot as (a). All methods perform 
comparably in this setting. 



Samples 


(c) High SNR problem: Optimal values of regularized true, and 
approximate log-likelihoods under the RW and Bethe approxima¬ 
tions. The true likelihood is always bounded by the approxima¬ 
tions. 



Samples 

(d) Low SNR problem: Same as (c) 


p 

.5 

.6 

.7 

.8 

.9 

1 

Test Loss 

0.013 

0.013 

0.017 

0.017 

0.020 

0.017 


Figure 2: Test loss for approximate MLE using FW on the 
house data set with a gap of 50 and different choices of 
uniform p vectors. 



Figure 3: Top: Approximate MAP assignment produced 
from the learned model. Red edges were incorrectly 
matched. Middle: Pseudomarginals for a correctly pre¬ 
dicted edge. The correct edge has high probability (red) 
while all others have low probability (blue). Bottom: Pseu¬ 
domarginals for a wrongly predicted edge. There are two 
edges with nontrivial probability (red and green). When the 
model is forced to pick one, it picked the wrong one. 


for all W. Moreover, 4w and £ B are global bounds on the 
maximum likelihood, so the inequalities also hold at their 
respective optima. That is. 


Figure 1: Exact likelihoods of the Bethe and RW estimates, 
and sandwich bounds on the likelihood. 


log4w(W^ w ) < \og£{W*) < log £ B {W* B ) (13) 


bound on Z (Wainwright et al. 2003), we have Z RW (W) — 
Z 0 5 (W) > Z(W) for any W. Second, for matchings, we 
have Z B (W) = Z-, (W) < Z(W ) < |Gurvits| 2011). There¬ 
fore, we have 


log £ rw (W) < log £(W) < log £ b (W) (12) 


where Wj£ w is the RW estimator, W* is the regularized 
MLE, and W B is the Bethe estimator. We plot the quan¬ 
tities of ( fl3| in Figure |T] b). We can also use ( [T2| > to ob¬ 
tain upper and lower bounds of £(Wg) and ^(W^ w ) by us¬ 
ing FW for inference to compute £ B {W£ SN ) and tm/(W B ), 
since £ B {W B ) and 4w(W^ w ) will already be available 
upon convergence of Algorithm |T] Appendix [A] shows 
these results. 








































Hotel House 



FW 

lin.+l. 

FW 

lin.+l. 

0 

0.0 

0.0 

0 

0.0 

10 

0.0 

0.0022 

0.0040 

0.0 

20 

0.0049 

0.0049 

0.0022 

0.0022 

30 

0.020 

0.020 

0.0049 

0.0 

40 

0.023 

0.013 

0.0 

0.0 

50 

0.0614 

0.050 

0.017 

0.022 

60 

0.13 

0.12 

0.041 

0.051 

70 

0.17 

0.15 

0.051 

0.067 

80 

0.24 

0.19 

0.080 

0.14 

90 

0.33 

0.30 

0.12 

0.11 


Figure 4: Test loss versus gap between frames for the ho¬ 
tel/house data measured via the Hamming loss for both 
MLE-Struct and the method of Caetano et al. (2009) 


4.1.2 GRAPH MATCHINGS 


We now apply the bipartite matching model to a graph 
matching problem arising in computer vision over the 
CMU house and hotel image sequences. We follow the 
setup of Caetano et al. ( 2009) . The data consist of 111 
frames of a toy house and 101 separate frames of a toy ho¬ 
tel, each rotated a fixed angle from its predecessor. Each 
frame was hand-labeled with the same 30 landmark points. 
We consider pairs of images at a fixed number of frames 
apart (the gap), which we divide into training, validation, 
and testing sets following the same splits as Caetano et al. 
( [20091 1. We measure the average Hamming error between 
the predicted matching (MAP estimate using our learned 
parameters) and the ground truth. 

We compare our algorithm against the linear+learning 
method of Caetano et al. ( j2009) >, which fits the parameters 
of a linear model using the same features as our algorithm 
but with a hinge loss objective. The results of the exper¬ 
iments with reweighting parameters p = 1 are described 
in Figure |4] For each subsequence, we chose the regular¬ 
ization parameter via cross-validation. Both methods per¬ 
form comparably, with our method doing slightly better on 
the houses and the method of Caetano et al. ( |2Q09] > doing 
slightly better on the hotels. We also compared the perfor¬ 
mance of our algorithm with different reweighting param¬ 
eters p. Figure [2] shows the results for the house data when 
the gap is 50 for various choices of p. We observed little 
difference in test error as p varies: this was confirmed over 
synthetic as well as real data. As a result, we did not tune 
p for different data/problem setups. 



Figure 5: ROC curves for roommate matching of our algo¬ 
rithm and a constant baseline. 


Profile Item 

Weight 

Smoking 

-0.0484 

Personality 

-0.0370 

I generally go to bed at... 

-0.0296 

I generally wake up at... 

-0.0218 

Study with audio/visual 

-0.0133 

Overnight Guests 

-0.0097 

Cleanliness 

-0.0056 


Table 1: Largest t\ distance features of roommate survey 
data. More negative values increasingly discourage fea¬ 
tures from differing. 


Algorithm |T| permits fast and simple approximate MLE in 
problems where it was previously quite difficult. Namely, 
we found that using standard BP approaches (Huang & Je 


bara 2009| Bayati et al.[ 2011) to compute marginals for 


the standard double-loop MLE approach was very unsta¬ 
ble for this problem because BP often failed to converge 
after taking several gradient steps. In later sections, we 
juxtapose Algorithm [T] with alternative approximate MLE 
approaches. 


4.2 UNIPARTITE PERFECT MATCHINGS 

Many undergraduate institutions assign first-year students 
to roommates based on questionnaire responses, but allow 
returning students to pick their own roommates in subse¬ 
quent years. Therefore, we can use observed roommate 
matchings of returning students to train a model for stu¬ 
dents’ preferences. Such a model can then be used by the 
administration to assign first-year students roommates that 
they will get along with. 


Figure [3] illustrates one advantage of learning a proba¬ 
bilistic model over a discriminative model: the pseudo¬ 
marginals indicate the model’s confidence in a prediction. 
In many cases, when the algorithm made the wrong pre¬ 
diction, two edges incident to a specific node had relatively 
high pseudomarginal probabilities. In these cases, the er¬ 
rors were not completely unfounded. Similar image parts 
were matched, albeit incorrectly. 


We obtained an anonymized dataset of campus housing 
room assignments and questionnaire responses for under¬ 
graduate students at a major US institution for the years 
2010-2012. We used data from 2010 and 2011 to train and 
data from 2012 to test. We prune those students who did not 
live in campus housing or were assigned to single rooms 
(there were no rooms with three or more residents). The 
remaining students were thus assigned to one roommate. 
























































and form perfect matchings in complete graphs of 2374- 
2504 nodes. As our data includes neither year nor gender, 
we treat the entire matching assignment for one year as one 
observation. 

Our questionnaire data consists of 2 binary features and 12 
ordinal features of 5 levels each. For each pair of students 
and each questionnaire question, we created one feature of 
absolute differences and many interaction features, which 
consisted of one indicator feature for each possible pair of 
answers to the questionnaire questions. For simplicity, we 
assumed symmetric interactions. For each student pair, the 
weighted score for their matching is a linear combination 
of features. 

We fit a model using MLE-Struct. Table [T] lists the largest 
coordinates of 9 for distance features ordinal on ordinal 
questionnaire responses. Here, more negative values indi¬ 
cate closer agreement required. We see that smoking, per¬ 
sonality (introverted vs extroverted), and sleeping habits re¬ 
quire the strongest agreement. For more results and details 
of the feature, see Appendix |A| 

As we are effectively performing multiclass classification 
of k, 2500 classes with only ss 14 features, we do not ex¬ 
pect high accuracy in terms of Hamming error. Instead, 
we consider the use-case where we use the model to re¬ 
ject very bad roommate assignments. To evaluate this, we 
use our learned 9 to form the cost matrix from features of 
the test year (2012), and use the entries of this cost ma¬ 
trix as scores for a binary classifier. We then plot ROC 
curves in Figure [5] where we demonstrate gains above ran¬ 
dom guesses and a constant baseline where 9 = — 1 for 
distance features and 0 for interactions. In particular, our 
algorithm dominates in the low false-positive regime of the 
graph. As competitors, we also evaluated a structured per- 
ceptron and structured SVM using the same MAP decoder, 
but even after extensive parameter tuning, they did not gen¬ 
eralize well to test data, and obtained test AUCs worse than 
the constant baseline. 


4.3 GRID CRFS 


Next we study a binary image segmentation problem on 
the Weizmann horses dataset ( |Borenstein & Ullman[|2002] i. 
We formulate this as a pairwise binary model with a vari¬ 
able for each pixel indicating whether it is part of a horse. 
We used the same features and resized images obtained 
from Domke ( 2013j > and kept their split of 200 training and 
128 testing images. This results in grid CRFs of approxi¬ 
mately 10,000-40,000 nodes. 


In initial experiments, we tried a naive double-loop method 
of subgradient descent on 9 using belief propagation to 
compute subgradients of the TRW log-partition function. 
This method was too slow and unstable for even small real 
problems. Instead we compared our algorithm against the 




(a) Raw test image (#207) and ground truth segmentation. 

MLE-Struct domke40 


9.2 min. 
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Domke 


(b) Estimated marginals of our algorithm and that of 
( |2013[ > on image #207 after various amounts of training time. Our 
algorithm terminated after 3.7 hours while that of [Domkej l j2013j l 
took nearly twice as long to converge. 


Figure 7: Visual results of the horse dataset. 


method proposed by |Domke (20131, which solves Q by 
using a small, fixed number of iterations of TRW to approx¬ 
imate the partition function and thereby the gradient of of 
the approximate likelihood. The outer loop runs L-BFGS 
until convergence. Limiting the number of inner TRW iter¬ 
ations is key to this algorithm’s efficiency, which burdens 
the user with another tuning parameter. 

In Figure[6j we compare two variants of our algorithm with 
three variants of the algorithm of Domke (20131 in terms 
of objective value and test error over time. Both meth¬ 
ods optimize the same objective—we optimize the dual 
while they optimizes the primal. We set A = 3420 in our 
parameterization which matches the setting for their pub¬ 
lished result. The MLE-Struct curves result from apply¬ 
ing the block-coordinate version of our Algorithm [I] We 
obtained the fastest convergence without using linesearch. 
MLE-Struct-wavg uses the same algorithm, but evaluates 
the test error using a weighted average of the iterates as de¬ 
scribed by Lacoste-Julien et al. ( |2013 i. The curve for aver¬ 
aged iterates is substantially smoother than the raw MLE- 
Struct curve, and very quickly attains a low test error. The 
domke.x' curves result from running their algorithm for x 
TRW inner loop iterations. This method is not guaranteed 
to converge to the global optimum for any finite x. A prac¬ 
titioner must run the algorithm for a sequence of increasing 
values of x to confirm convergence to the correct value. In 
contrast, our FW algorithm requires only a single run and 
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(a) Plot of the difference to the midpoint of the best values attained 
by each algorithm (duality gap). The domkelO curve intersects, 
but does not converge to the correct value. 
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(b) Test error (mean Hamming loss) vs time. MLE-Struct-wavg 
curve traces the error of our method with averaged iterates, which 
achieves the lowest test error and gets these fastest. 


Figure 6: Horse dataset performance measures. MLE-Struct is our method; legend is same across both plots. 


computes an upper-bound on the duality gap as a byprod¬ 
uct (Jaggi 2013[ >. 

In early iterations our algorithm achieves the lowest objec¬ 
tive value and test error. Our algorithm attains low test error 
even when the objective value is relatively far from optimal. 
This phenomenon is a result of the dual formulation: we it¬ 
eratively move t to minimize the objective, but for each 
value of r, we compute the optimal 9 as a linear function 
of r. While r may initially be very inaccurate, contribut¬ 
ing to a large objective value, 9 is much lower dimensional, 
so some of the errors may cancel in computing 9, resulting 
in good predictions nevertheless. In contrast, the method 
of Domke ( 2013) 1 iteratively moves 9, and for each value 
of 9, computes the optimal value of r using TRW. Prior to 
convergence, this may enable better fit to the training data 
at the expense of accurate estimation of 9. 


The effect is readily apparent when visualizing predicted 
marginals on the test set in Figure [7] The domke40 method 
takes 9.2 minutes to complete one iteration. Parameters 
after one iteration are nowhere near the MLE, as evidenced 
by the first row of[7jb) and the early portion of the objective 
value plot^a); the mean Hamming loss on this sample is 
0.249. Their marginal estimates at this point have only used 
local intensity data: light regions are classified as “horse” 
and dark regions are classified as “not horse.” In contrast, 
in about the same time, our BCFW method already ran for 
12000 iterations, has made 60 passes over the training data, 
and essentially recovers the correct segmentation (except 
for difficult portions on the mane and hind legs, where the 
background texture is confusing) with mean Hamming loss 
of 0.068. After 3.7h, both algorithms produce comparable 
visualizations, though we get 0.074 normalized Hamming 
error on this sample, while domke40 gets 0.109. It takes 
8.2h for the domke40 to converge according to its internal 
criteria, attaining a final test error of 0.0813 on this image. 


5 DISCUSSION AND CONCLUSION 

In maximum likelihood estimation of discrete exponen¬ 
tial family models, replacing the Gibbs free energy with 
a convex free energy approximation leads to a concave- 
convex saddle point problem. We have shown that adding 
a quadratic regularizer enables a closed-form maximiza¬ 
tion, leaving a single convex minimization problem, which 
can be solved efficiently using the Frank-Wolfe algorithm. 
We can scale to large datasets by using block-coordinate 
Frank-Wolfe, and rapidly achieve low test error by solving 
the dual objective. This method accomplishes approximate 
MLE using a simple wrapper around a black-box MAP 
solver. Previously, practitioners either employed expen¬ 
sive double-loop MLE procedures or they abandoned MLE 
by resorting to structured SVMs and perceptrons. Our 
method is competitive with max-margin MAP-based esti¬ 
mation methods in terms of prediction error and faster than 
competing MLE methods in minimizing test error, while 
being simple to implement. In future work, we will extend 
the method to other combinatorial problems, incorporate 
structure learning with t\ regularization, and handle latent 
variable models as a single minimax problem. 
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Samples Samples 

(a) Exact and approximate likelihoods evaluated at the parameters (b) Exact and approximate likelihoods evaluated at the parameters 
that maximize the Bethe log-likelihood. that maximize the RW log-likelihood. 

Figure 8: Sandwich bounds on the likelihood at the Bethe and TRW estimators. Top is the high SNR problem, bottom is 
the low SNR problem. 


Supplementary Material 

The following material is used to provide details about techniques employed in the paper. Part [A] presents details of 
experimental setup and additional results. Then in part[B]we discuss the convergence properties of FW for our problems. 
In part[C]we prove convexity of the Bethe free energy for general matchings. Part[D]derives an instance of our Algorithm[I] 
for matchings, and part[E]derives the same for linear CRFs. 

Then, we present a full derivation of the dual objective and line search objective for doing FW learning for graph matchings. 
This is presented in terms of matrix-valued terms (various weighted adjacency matrices for the graph), which facilitates 
easy implementation, since MAP solvers operate on such matrices. 


A Details on Experiments 

In this appendix, we explain further details of experiments. 

A.l Synthetic Experiments 

For matchings, the TRW likelihood lower bounds the true likelihood, while the Bethe likelihood upper bounds the true like¬ 
lihood at all parameter values. Thus, we can obtain two-sided bounds on the likelihood of W^ RW by evaluating Pb{Wj RW ), 
and of the likelihood of Wg by evaluating ^TRW (Wg). Figure [8]displays these results. The approximate likelihoods were 
computed using FW for inference, using the procedure described in Sec. |3.2| while the exact likelihood was computed using 
Ryser’s algorithm, which was feasible for this small problem. 


A.2 Horse Experiments 


Timing experiments were performed a dedicated 8 core (16 hyper-threads), 2.67GHz Intel Xeon X5550 machine with 
24 GB of physical RAM, running Ubuntu 12.04.5 LTS and Matlab R2012b. Computations were restricted to a single 
core, and at most two experiments were run at a time. Our algorithms were implemented in Matlab, interfacing with 
combinatorial solvers written in C or C++. We downloaded the code for ( |Domke[ |2013j > and ( |Caetano et al.| |2009| > from 
the authors’ websites, which were implemented in C++ and Matlab with C extensions, respectively. We obtained the 
original experiment scripts for (Dornke} 2013j) through correspondence with the author. 
































Profile Item Description 

Weight 

Rank 

I generally go to bed at... 

-0.0296 

3 

I generally wake up at... 

-0.0218 

4 

Rising Sophomore 

0 

10 

Cleanliness 

-0.0056 

7 

Smoking 

-0.0484 

1 

Sleeping Habits 

-0.0025 

8 

Overnight Guests 

-0.0097 

6 

Personality 

-0.037 

2 

Usual Study Hours 

0.0131 

12 

Study Location 

-0.0006 

9 

Study with audio/visual 

-0.0133 

5 

Single-sex floor (1) 

0.2277 

14 

Single-sex floor (2) 

0.0518 

13 

Allow in Brownstone 

0 

10 


Figure 9: Distance features for the roommates experiments, their learned weights, and their relative importance ranking 
with regularization A = 100. 

A.3 Roommate Experiments 

This dataset was obtained from a major US university over a three year period from 2010-2012. The anonymized dataset 
consists of roommate assignments for pairs of students in each of the three years. In addition, each of the students was 
required to complete a brief housing survey that asked for their preferences in terms of cleanliness, sleeping schedule and 
habits, personality, study preferences, etc. Our questionnaire data consists of 2 binary features and 12 ordinal features of 5 
levels each. For each pair of students and each questionnaire question, we created one feature of absolute differences and 
several interaction indicator features, one for each possible pair of answers to the questionnaire questions. For simplicity, 
we assumed symmetric interactions. For each student pair, the weighted score for their matching is a linear combination 
of features. The learned weights for each distance features and their relative rankings are described in Figure [9] As we 
are using a log-linear model, the weights should be interpreted as log-odds ratio for a unit increase in absolute distance 
(assuming ordinal features). 

A few qualitative observations about these results. First, single-sex floor, rising sophomore , and allow in Brownstone all 
received relatively low weights, indicating perhaps that the data was too noisy with respect to these survey responses. 
Second, personality, smoking, and bedtime were among the strongest predictors of a successful match while cleanliness, 
study hours, study location were among the least important. 

When comparing to the BCFW algorithm for a structured S VM, we employed the publicly-available code from the authors. 
We tried regularization parameter lambda values in the range [10e-4, 10e2]. Our best-performing configuration achieved 
an AUC of .504 and a hamming error of .9992, which outperforms random guessing, but significantly underperforms a 
model trained with MLE. 


B Convergence of Frank-Wolfe 

Recall the following approximate max-entropy objective function. 

i(T(1:m)) = ^n r (t ( 1 :M) )ii 2 - E #pE m) ), 

m 

In this appendix, we discuss the convergence rate of the FW algorithm for the minimization of the convex function L over 
the local polytope. |Jaggi~|(2013j> has shown that the suboptimality of the iterates of FW decays as 


L(r t )-L(r*)<^(l + 6), 


(14) 


where S is the accuracy to which each of the linear subproblems (i.e., the optimization problem performed at each iteration) 
is solved and Cl is the curvature of the function L. 










Curvature is a stronger notion of the function’s geometry than its Lipschitz parameter, since it is affine-invariant, like the 
entire Frank-Wolfe algorithm (Jaggi 2013f. The curvature of a differentiable function F : X —> M is given by 


C F = sup ~^{F(y)-F(x)-{y-x,VF(x))). 

x,x'&X, 7G[0,1] 7 
y—x+'y(x'-x) 


(15) 


The curvature, Cp, quantifies how much F can differ from its linearization. For twice differentiable functions, the curvature 
can be upper bounded as follows. 

C F < sup hy - x) T \7 2 F(y - x) 

x,x'&X, 76[0,1] ^ 
y—x+^/(x'-x) 


For our objective function, the curvature is most heavily influenced by the entropy term as the curvature of the quadratic 
piece is simply a constant that depends on the (f>’s. Unfortunately, the curvature of L is unbounded: as one approaches 
integer points in the local polytope, the entropy approximation becomes arbitrarily steep. Therefore, the worst-case con¬ 
vergence rate of FW for this problem is unbounded. In order to obtain a bounded curvature, we could strengthen the box 
constraints in the marginal polytope by requiring 


T i{yi) e [ 77,1 - 77 ] for all ieV,y l ey 
T a {y a ) e [ 77,1 - 77 ] for all a £ A,y a £ 

for some 77 £ (0, .5). For each the part of the entropy approximation that depends on r,):;;,) looks like 

1 - ^2 Pa\ Ti(x,) log Ti(Xi). 

V aDi ) 


That is, all we need to do is compute bounds on the curvature for functions of the form f(x) = c • x log x. For this function, 


Cf< 


c(x' — x) 

2 y 


for any y £ [x,x f ]. To bound this quantity from above, we can pick x = 77 , x' = 1 — 77 , and y = 77 . This gives 


Cf< 


c( 1 — 2y) 2 
2 V 


which is a fixed constant and tends to zero as 77 tends towards .5. Hence, as long as the optimal pseudomarginals lie strictly 
inside [0,1], then this modified FW is always guaranteed a linear rate of convergence to the optimum. The pseudomarginals 
produced by both BP and RBP often lie strictly inside the box constraints, so this is typically not an issue in practice. In 


order to find an appropriate 77 , we could use, for example, the bounds on the pseudomarginals proposed by Mooij & 


|Kappen| ( |2007| >. These bounds are obtained by running BP/RBP for a fixed number of iterations. An appropriate 77 can then 
be selected such that the interval [ 77 ,1 — 77 ] contains all of the bounds as in Mooij & Kappen (2007). 


Adding additional box constraints is expensive (we can no longer use the combinatorial algorithms that we used for match¬ 
ing and pairwise binary MRFs), and they may not be necessary in practice. For pseudomarginals r, the steepness only 
becomes unmanageable when there are components of r that are close to 0 or 1 (in a Gibbs distribution, no clique as¬ 
signment ever has zero probability, so VT is always well-defined). If the iterates of the algorithm never get too close to 
the boundary of T, then the effective curvature term will be reasonable. Of course, if the optimal pseudomarginals of 
the reweighted approximation are close to the boundary of T, then Cl will be large in the neighborhood of the solution, 
and we should not expect fast convergence. A rough way to estimate this distance from the boundary is via the entropy 
of the pseudomarginals, and our experience has shown that the algorithm converges faster when the true pseudomarginal 
distribution has higher entropy. 


C Convexity of the Bethe Free Energy for General Matchings 

In this appendix, we argue that the Bethe free energy for the (not necessarily perfect) matching problem is convex over gen¬ 
eral graphs. The convexity of the Bethe approximation for the bipartite matching problem was investigated experimentally 















by Huang & Jebara ( 2009[ ) and the proven by [Vontobel ( 2013} ). The same argument holds for any choice of reweighting 
parameters such that p, £ [0,1] for all i. For simplicity we only argue the case p, = 1 for all i, the general case is similar 
to Theorem 60 in |Vontobel] ( |2013| . The entropy and polytope approximations are formulated as follows. 

H p( T ) ~ 55 (A + Pj !)(! T ij) l°g(l — T ij) ~ T ij l°g T y 
- 51M 1 - 55 Ti *) 1 °s( 1 - 55 Ti J') 


iGV 


j£di 


j&di 


where r is restricted to T' = {r > 0 : for all* £ V, +j G dir\j < 1}. 
The proof of convexity follows directly from the concavity of the function 


n 

S n {x i, ...,X n ) = ^(1 - Xi) log(l - Xi) - XilogXi 
i= 1 

for each n > 1 ( |Vontobei]|2013| l. 

Theorem C.l. For any p £ [0,1]^ I, any graph (bipartite or general), and any matching (perfect or imperfect), the 
reweighted free energy 0 is convex over the local polytope. 


Proof. For the case of the perfect matching problem on a graph G, the entropy term of the Bethe free energy can be written 
as 


H'y(r) 


55 (i Tij ) log(l - Tij ) - Tij log Tij 

(iJ)eB 
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Here, h(x) = —x\ogx — (1 — x) log(l — x) is the entropy function. As both S and h are concave functions, the entropy 
function is concave which implies that the free energy approximation is convex. □ 


D Frank-Wolfe and Matchings 

In this appendix, we describe a conditional random field over perfect matchings, formulate the approximate learning 
problem in this context, and describe the linesearch procedure used as part of the FW algorithm. 

Assume we have M observations, consisting of N items matched to N other items. We represent the m’th observation 
by {W^ m \ where the W and X are N x D\y and N x Dx data matrices and is an N x N column 

permutation matrix]^ Note that W and X contain the data for the two separate parts of the graph. 

In general, conditional random field features can be arbitrary functions of (W,X,Y). To produce a model whose MAP 
solution is a maximum-weight perfect matching, we require the features to be linear in Y. Since Y v denotes the presence 
or absence of edge (i,j), its coefficient ought to depend only on the data for items i and j. Therefore, we use the feature 
map Fk(W, X , Y) = ( Gk(W 7 X),Y) where ( Gk(W ., X )) j . = gh (Wi , Xj). That is, the fc’th feature is a linear function of 
Y with coefficients given by applying a single function gp : R D - Y x BJ >W to every pair of rows in W and A'. We will 
have K features in total. We now write G^" ■’ = G/ : f X (m, 'j and dispense with W and X. The probability of one 

observation is thus 

p(Y\G 1:K ; 0) = ^ exp ^ Y > 

So the log-likelihood for M i.i.d. observations is 

i (9; Y^ m \G^) = 55 55 e k (g { ™\y^) - log z(o, G$) 

m k 

2 That is, if i maps to j, then Yji = 0 and Yki = 0 for k 7 ^ j. 


( 16 ) 



























We focus on the case K < MN 2 . 


D.l Minimax Formulation 


We now replace log Z with log Zb p and add an L 2 regularizer. Note that p is an iV-vector reweighting parameter with 
entries between 0.5 and 1. Using the variational formulation of the (Bethe) free energy, we write the maximum Bethe 
likelihood problem as a minimax problem, which we further analytically reduce to a convex program with linear constraints. 
Begin with 

- log (0, G<”g) = min - ]T 9 k (Gi m) , t) - H p (T). (17) 

k 

To simplify subsequent derivations, let y( m ' ) = vec = vec (T( m )), and G^ be an N 2 x K matrix whose 

fc’th column is given by vec ^G^"' 1 J. In the sequel, we will use the reweighting parameters in the form of pairwise sums 
Pi + pj. Thus, let R be N x N matrix where Rij = pi + pj and let r = vec (R). Additionally, define y,r, and G by 
vertically stacking all M members of i/( m ) ; T fjn ^. and G^ m \ Thus we can rewrite 0 k (G^ l \Y^ m ^ = 9 T (G T t/). 
Plugging © into ( |~i~6j ) and adding a quadratic penalty gives the problem 

max 0 T ( G T y) ± \\6\\l + £ ^ mm - 0 T H p (r<"*>) 

m 

= max(T (G T y) - ± \\ef 2 + mm, -9 T (G T r) - £ H p (rW) 

m 

= min M max 0 T (G T (y - r)) - ± \\0\\ 2 2 - £ (t«) (18) 

m 

=: min max f (r , 6) 

r eM M e 


The second line is justified because the minimizations in T' 7 "- 1 are separable, so the min and sum operators commute. The 
cost is that we must now minimize over a larger product space _A/f M , but we will see later why this is not a problem. The 
last line follows from Sion’s minimax theorem: the minimization domain M. m is compact convex, and the objective is 
convex in the minimization variable r and concave in the maximization variable 9 (Sion 1958). The theorem requires only 
one compact domain, so 9 can remain unconstrained. Thus, for any r, the concave function / (r, •) attains its maximum 
at the stationary point 0 = Vg/ = G T (y — r) — A 9, e.g. 9 = A ~ X G T (y — r). Moreover, / (r, •) is strictly concave for 
A > 0, so the maximum is unique. Plugging in to ( 18 i and simplifying, we get 


,SiA h i |gT(9 - r) ^ ( T<m| ) 

m 

min h (r) 
t£M m 


(19) 


D.2 Line search 


To compute the next iterate of FW, r t+ i, we use linesearch. Write r t+ \ = (1 — r)r t + Plugging in to (19 1 , we get 

ht(v) '■= ||G T (y - (1 - rf)T t - y t* + 1 )\f - ^ H RW ((1 - y)rj m) + yr*^; p) 

m 

= ^ {II gT (y - n)f + 2? 7 (y - n) T GG T (r t - r t * +1 ) + y 2 || G t (r t 


‘t+lj 




- Hrw (t 1 ~ 7 ?) T t” l) + VR+i^iP) 

m 

Thus we can precompute the expensive matrix products in the quadratic term. 


( 20 ) 


D.3 General Matchings 

The above FW procedure only requires a few changes when switching from complete bipartite graphs to general graphs. 
The same equations and steps hold when we replace biadjacency feature matrices with adjacency features, and permutation 









matrices with matrices representing perfect matchings. There are only a few technical caveats. First, for general graphs 
we need to be able to allow some t, :) to be zero. This can occur because either there is no edge between i and j, or i and j 
are neighbors, but there is no possible perfect matching in which they are linked. For both of these cases, we simply clamp 
Tij at zero. Similarly, some edges occur in every perfect matching, so we need to discover these a-priori and clamp r 7J at 
one. Second, unlike for bipartite matching, initialization of r is non-trivial, since the set of neighbors Nb(i) is different for 
every i. We cannot choose an integral r from the local marginal polytope as an initial point, since the curvature is infinite 
there. Instead, for every edge in the graph, we can find one matching that contains that edge and one matching does not by 
solving a series of matching problems. We average all of these matchings to obtain an initial feasible point. 


E Frank Wolfe and Linear CRFs 

E.l Notation 

We work with a conditional random field of L labels over the graph G = (N, E) in the standard overcomplete parameteri¬ 
zation. That is, y n is an L x 1 indicator vector for the state of node v, and y e is an L x L indicator matrix for the state of 
edge e. We will also treat y e as a vector when convenient. We denote an element of a matrix or vector by parentheses. For 
node n, let u n be its C x 1 feature vector and for edge e, let v, be its D x 1 feature vector. Implicitly, these feature vectors 
are derived from applying some function to an input vector x. We refer to elements of a vector We will learn a linear map 
for the node and edge parameters: 

9 n = Fu n Vn € N 
0 e = Gv e Ve € E 

Now suppose we have M exchangeable samples, and let the superscript • m denote the observation belonging to the mth 
sample. Our joint log-likelihood is thus 

t (F, G; y, u, v) = E fe V% T Fu* + £ y? T Gv)l - log Z (. F , G, u, v) 


E.2 Minimax Formulation 

We replace log Z with a parameterized surrogate likelihood log Z p which interpolates between the TRW and Bethe ap¬ 
proximations. We use the variational formulation of log Z p , over the local polytope T. Note that the Bethe approximation 
is not convex in this setting, but TRW is. For grid MRFs, each edge has probability 0.5 of appearing in a spanning tree, so 
we set p = 0.5 for each edge. 

Since we are estimating matrix parameters, we add a Frobenius penalty. The minimax formulation is thus 

ma*£ [Y.Vn T F< +Y.y™ TGv e) \ ml \ l|G|| 2 F 
+ E (E < Tp <+ E ic t gv?) - h p {p) 

m ^ \ n e ) 

= min max £ ( £ (C - aC) T Fu? + £ (y™ p™) T GvA (21) 

-j iinii-jiicii 

m 

where the reweighted approximate entropy is given by 

-^p(m) * = ^ ^ ^ ^ Pnn'^Ann') 

n£N nn'(zE 

= E! G[{p j n) — El Pnn 1 [H{Pn) + H{Pn’)\ + E Pnn' H(p 

nn ') 

riGAT nG N n'GNe(n) nn'^E 



(22) 


— ^ ' I 1 ^ \ Pnn’ I H (pn) T" ^ ^ Pnn' H (p n n') 

n£N y n'£Ne(n) J nn'£N 

where J(/w) = Yjy n y n , Pnn' (y n , Vn') log [Pnn'iVn, Vn') / Pn(y n ) Pn' (y n ')\ is the mutual information between variables 
n and n' and H(p n ) = - J2 Vrl Pn(y n ) \ogp n (y n ) and H(p nn ') = - H Vn y n , Pnn'(y n -,y n ')\ogp, nn ’(y ni y n ') are single- 
ton and pairwise entropies. We have used the identity I(p n n') = H(p n ) + H(p n >) — H(p nn >). We have implicitly used 
the pairwise marginalization constraints when using the mutual information identity, so these gradients are valid only on 
the local polytope—a fact that is important to remember when optimizing. 


The stationary point of the objective in ( |2T| is thus 

0 = 

=> F = 


E / m , m\ „.raT \ t -> 

\Vn -Pn) u n ~ XF 

mn 

a- 1 E(c-Mn)«; T 

mn 


(23) 


Similarly, G = A -1 ^ me (vT ~ PT) V J- Recalling the definition of the Frobenius norm and rearranging some summa¬ 
tions, we get 

2 


A 

2 


\\F\\l = 


A 

2A2 


E^ 



2A 


EEW-'O 1 {yn'-Pn') 




(24) 


mn m'n' 


On the other hand, the quadratic terms are 
1 
A 


■E^-/C) T E (vx'-rf )** 1 


EEW-tf) T (Vn' -Pn' 


u m ! T u m 
u n f u n 


mn m'n' 

= ^\\F\\ 2 f 


(25) 


e.g. the same Frobenius norm of outer products, by comparison with (|24|>. 


We have eliminated the matrix F (and similarly, G), which reveals that our objective is a quadratic form in the Gram 
matrices UU T and VV 1 , where U is obtained by vertically stacking the u™ and V obtained by vertically stacking the 
v™, so that the entry (UU T ) , , = u™ T u™, and (VV T ) , , = v™ T v(Ji . Let Yv- T/v be the matrices obtained 
whose (mn,£Y th entry is given by p™(£) and V e ,T e be the matrices whose (me, f£')’th entries are given by 

v™(£, £'), p™(£, £')■ The objective is quadratic in (Ye — Te) and (Yjv — T)v), so we can flip them to simplify some signs 
later. Write W = T - Y. Then the objective is 

Tn ™* tm = ^(w n wt,uu t ) + ±(w e wt,vv t )-h p (t n ,Te) 

^ tr (W^UU r W N ) + E tr (WjVV T W E ) - H p (T N , T E ) 

= ± \\U T W N \\ 2 F + ± \\V r W E t - H p (T N ,T E ) (26) 

with a matricized form of the entropy as 


H P (T N ,T E ) = — (1 — R n, Tn o log T n ) — (R e ,T e o log T e ) 


where o denotes elementwise multiplication, and // v, R e are matrices of reweighting parameters conforming to Tjv, T e . 
That is, (R N ) nm . = E„'6Ne(n) Pnn' while (R E ) nep = Pnn'- From this form, the gradients are evidently 

f) M 

= -(l-R N )o(l + logT N ) 

= —Re ° (1 + log T E ) 

0± N 

So the gradients of our objective are 


dh 

dh 

dTy 


A ~ 1 UU T (T n - Y n ) + (1 - R N ) o (1 + log Tjv) 
A” W T (T e - Ye) + Re o (1 + log T E ) 


(27) 


( 28 ) 










E.2.1 Linesearch 


The quadratic terms of h(fi + rid), as a function of 77, are 

— (||C/ T {Wn + r]D N )\\ F + ||U T (We + 77£>e)|| f ^ 

= ^ (\\U r W N + vU t d n \\ 2 f + \\v t w e + i 1 v t d e \\ 2 f ) 

= ^ (\\U t W n \\ 2 f + \\V t We\\ 2 f + v((U t W n ,U t D n ) + (V t W e ,V t D e )) + r, 2 (\\U r D N ||* + ||F t D £ || f )) 

The inner products (■, ■) are matrix inner products of C x L and D x L 2 matrices. For the Bethe/TRW entropy, we will 
just have to treat it as a black box. 

E.2.2 Block-Coordinate Updates 

For block-coordinate Frank-Wolfe, we can update the gradient and perform linesearch without computing the full inner 
products U t Wn and V T W E - Let 77 denote a step size, D m = ( DD F ) denote the step direction for sample to, and 

U m , V m be the submatrices containing only those rows for sample to. More precisely, DD^'j where 
has the same number of rows as U and l) F has the same number of rows as V, but only those rows for sample to are 
nonzero. Suppose we move to the point T + r)D I m l = + rjD^, T E + r)D ^ ^. Then our new inner products are 

U T (t n + V D 1 ™ ] - yjv) = U t W n + r,U mT D m 
V T (ji E + Vd [m] - Ve) = V t W e + r]V mT d m 

Thus we need only add the update terms r)U mT and //l /m 1 f) 7 ^. The algorithm depends on the value of T only through 
these inner products. 

F Experiments for FW-based Inference 


(2013), using code obtained from the authors. In (a) and (b) we plot the distance of the approximate marginals from 
exact marginals computed with brute force v.s. the number of calls to the maximum-matching solver. We use a bipartite 
graph with 10 nodes on each side, i.i.d. edge weights distributed unif[0,1], and inverse temperatures of 10 (a) and 0.25 
(b). In (c), we run each algorithm for a very large number of MAP calls for a range of temperatures in order to identify 
the affect of temperature on the algorithms’ errors, and results are aggregated over 10 random n = 10 graphs. Overall, 
we find that the Bethe approximation provided by FW is substantially more accurate than perturb-and-MAP and that FW 
converges more quickly. However, (c) suggests that changes in temperature affect the algorithms’ approximation accuracies 
differently. 


In Figure [TO] we compare Frank-Wolfe (FW) for bipartite perfect matching to the perturb-and-MAP algorithm of Li et al. 





inverse temperature 


Figure 10: Frank-Wolfe v.s. Perturb-and-MAP 


Our proposed FW algorithm for bipartite perfect matching and the belief propagation algorithm of Huang & Jebara ( |2009j ) 
minimize the same objective over the same polytope, so we focus on the speed-accuracy trade-offs of the algorithms. 
































Zoo error 

.05 

.01 

0.005 

0.001 

rand-20 

3.37 

1.56 

1.12 

0 .25 

rand-50 

18.0 

11.76 

5.18 

0.91 

rand-75 

23.87 

23.87 

9.8 

1.5 

rand-100 

19.24 

19.24 

10.6 

1.75 

lda-20 

1.6 

.61 

.34 

.10 


Table 2: Frank-Wolfe speedup over BP for various error tolerances. 


In Table [2j ’rand-n‘ refers to random complete bipartite graphs with n nodes on each side and i.i.d. edge weights from 
unif[0,1]. The ‘lda-20’ experiment aligns topics from different runs of a Gibbs sampler for an LDA topic model with 20 
topics. All results are averages over 10 graphs. We run the algorithms with a range of termination tolerances in order to 
obtain various speed-accuracy points. Then, for a range of l x distances to the true Bethe marginals, we compute the time 
necessary to achieve the specified error. The table presents the ratio of computation time for BP to FW (ratio > 1 means 
FW is faster). As expected from an algorithm that is no faster than O(j), we find that it gives good accuracy quickly, but 
is slow to converge to within very tight error tolerances. Such a speed-accuracy trade off affects other first order methods 
such as SGD, but can still be advantageous in many cases, including large-scale applications or when optimizing beyond 
the statistical error of the problem is pointless. 
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